In silico analysis of the human milk oligosaccharide glycome reveals key enzymes of their biosynthesis

Human milk oligosaccharides (HMOs) form the third most abundant component of human milk and are known to convey several benefits to the neonate, including protection from viral and bacterial pathogens, training of the immune system, and influencing the gut microbiome. As HMO production during lactation is driven by enzymes that are common to other glycosylation processes, we adapted a model of mucin-type GalNAc-linked glycosylation enzymes to act on free lactose. We identified a subset of 11 enzyme activities that can account for 206 of 226 distinct HMOs isolated from human milk and constructed a biosynthetic reaction network that identifies 5 new core HMO structures. A comparison of monosaccharide compositions demonstrated that the model was able to discriminate between two possible groups of intermediates between major subnetworks, and to assign possible structures to several previously uncharacterised HMOs. The effect of enzyme knockouts is presented, identifying β-1,4-galactosyltransferase and β-1,3-N-acetylglucosaminyltransferase as key enzyme activities involved in the generation of the observed HMO glycosylation patterns. The model also provides a synthesis chassis for the most common HMOs found in lactating mothers.

Human milk, aside from its value as a source of food for the new-born infant, has increasingly been shown to have many additional benefits in promoting development and providing protection from diseases. Human milk oligosaccharides (HMOs) are the third most abundant constituent of milk, after lactose, and lipids, with total mean concentrations ranging from 4-30 g/L 1 . The majority of these complex oligosaccharides are based on free lactose. Their compositions and structures have been the subject of several studies dating back to the 1950s, starting with the work of Kuhn 2 and others [3][4][5] , and subsequent characterisations by Kobata, Ginsburg and coworkers [6][7][8][9][10][11][12][13][14][15] . Although indigestible by the neonate, HMOs provide several benefits to it, including antimicrobial action 16,17 , protection from viral pathogens [18][19][20][21] and necrotising enterocolitis [22][23][24] , promotion of a healthy gut microbiota 25,26 and the development of the immune and nervous systems [27][28][29] . HMOs may also play a role in preventing allergic reactions during childhood 30 .
A feature unique to human milk is the heterogeneity of its HMO population, when compared to milk from other species, such as bovine, which are lower in both quantity and structural diversity 31 . Anti-adhesive properties of HMOs rely on competitive inhibition with pathogens for host receptors, as is the case, for example, with the finding that mono-and di-fucosylated oligosaccharides inhibit the binding of cholera toxin to glycosylated receptors of human epithelial cells 32 . By presenting a wide variety of epitopes, the human milk oligosaccharide is able to mask the newborn from such toxins, while simultaneously promoting a colonisation of beneficial gut flora of Bifidobacterial species 33 .
In consequence, there has been much interest towards the synthesis of HMOs industrially, for use as probiotics, for which a number of approaches have been used, including direct chemical syntheses, use of recombinant glycosyltransferase (GT) activities 34 ; transglucosidase reactions, in which glycohydrolases act in reverse to attach monosaccharides to oligosaccharides [35][36][37] ; and through reconstruction of Leloir-type 38 pathways to synthesise the preferred HMO substrate of bifidobacteria, lacto-N-biose I 39 . Although little is known as to the actual enzyme activities expressed in the lactating mammary epithelium 40 , some insights can be gleaned from analysis of the more than 200 HMO structures already characterised, many of which have been employed as model substrates of glycosidases and glycosyltransferases (for example, 41 ).
Based on the observation that mammary gland is secretory in nature, with properties similar to those of mucosal surfaces 42 , a view that has also been supported by the murine case 43 , we assumed that some of the number, comprising two galactosyltransferases (enzymes 1 and 6), three are N-acetylglucosaminyltransferases (4, 10, 11), with three fucosyltransferases (2, 3, 5) and three siayltransferases (7)(8)(9).
As shown in Fig. 1 (Linkage types), HMOs, in common with mucin-type O-glycans and other glycoconjugates, possess non-reducing termini that are based on Gal-β1,3-GlcNAc and Gal-β1,4-GlcNAc motifs, denoted type 1 and type 2 60,61 , respectively, which are formed by the actions of enzymes 6 and 1 on a structure terminating in GlcNAc.
The smallest model network employing all of the activities of the enzymes in Table 2, is shown in Fig. 2.
Bifunctional enzymes. Glycosyltransferases (GTs) are multi-specific, acting on a range of acceptors according to a recognition motif. It is known that many enzymes, including GTs, can display secondary activities, a behaviour sometimes called enzyme promiscuity 63 . Several such enzymes activities are included in the model. For example, in Table 2 the α2FucT enzyme (3) combines the activities of the type 1 and type 2 galactoside 2-α-l-fucosyltransferases, which transfer fucose to either lactose, to form 2′-FL, or lacto-N-biose or LacNAc termini. The activities of the other fucosyltransferases, α3FucT and α4FucT, are recognised separately; however, since α4FucT bears a 3-α-fucosyltransferase towards the glucose of free lactose 64 , this was included as a secondary activity of enzyme 2.  40 extended this number of core structures to 19 (I-XIX). Among the HMOs considered in the present study, we proposed an additional five novel cores to those of the latter classification. The full set of HMO core structures found is given in Supplementary Table S2. Since both earlier classifications provided a large number of distinct sub-populations that were found not to correlate spatially in layouts of the biosynthetic networks (Fig. 4), it made their interpretation difficult. For easier visualisation of the networks, we adopted a simpler classification system based on the initial actions of β3GnT (4), followed by those of the two galactosyltransferases, β4GalT (1) and β3GalT (6), and then the branching N-acetylglucosaminyltransferases, cIGnT (10) and dIGnT (11). This approach divides the oligosaccharides into six classes: two that are linear (types 1 and 2, terminating in β-1,3-Gal and β-1,4-Gal, respectively), and four that combined the actions of cIGnT and dIGnT with subsequent extension by β3GalT and β4GalT. We denoted these four branching possibilities by a/b, where a signifies the type of the 6-linked GlcNAc (upper branch), and b the 3-linked GlcNAc (lower branch), leading to four combinations of types 1 and 2, namely, 1/1, 1/2, 2/1 and 2/2. In common biochemical nomenclature, 0/1 is lacto-N-tetraose (LNT; Core II), 2/1 is lacto-N-hexaose (LNH; Core V), 0/2 is lacto-N-neotetraose (LNnT; Core III) and 2/2 is lacto-N-neohexaose (LNnH; Core VI). Figure 4B highlights the observed structures, and their distribution through the network. All regions of the main network have observed counterparts, with the exceptions of the 1/1 (magenta) or 1/2 (green) branching pattern. For the 206 observed HMOs predicted by the model, a minimal biosynthetic network was constructed by reversing the enzymes of glycosylation and compiling a network of all of the reversed-reversed reactions leading from lactose to observed products. The resulting reduced network is shown in Fig. 4C and D, coloured according to the base-core scheme of Fig. 1. The network with 514 HMOs (nodes), including intermediates, and 966 distinct reactions (edges). In Fig. 4D, only the nodes matching the predicted, experimentally observed, HMOs of Supplementary Table S1 are coloured (206 nodes). On account of the multiantennary nature of many HMOs (167 of the 226 HMOs studied bear at least one β-6-GlcNAc residue), multiple routes to the same product are possible, which results in a lattice-like appearance of the networks. This is also seen in other studies of glycosylation reaction networks, including those of HMOs, N-linked and O-linked glycans 45 Table 2) involved in the biosynthesis of each motif are displayed beneath its structure. A complete set of core structures is in Supplementary Table S2 Figure S1), in which all of the enzymes except the cIGnT activity (10) were used. It was observed that none of the most abundant HMOs were of the base-core 2/2, but were linear 0/1, 0/2, or branched 2/1, according to the classification scheme employed here.
Novel cores/delayed branching. A possible biosynthetic pathway of eighteen of the core structures I-XIX, along with the structures of the five novel cores, is shown in Fig. 5A.
The only omission from the network is core IV, which was not predicted (see "Discussion", Other enzyme activities). Given that model predicts a subpopulation of 1/1 and 1/2 base-core structures, which are not part of the observational dataset, of interest are the existence of the "delayed branching" structures, novo-LNnO 40,76 , which is Core XIII 40 , and its monofucosylated derivative, F-novo-LNnO 40 , since they display the 1/1 branching pattern elsewhere in the molecule. An additional core structure, not included in the HMO library, is F-novo-LNO, which is type 2 on the upper and type 1 on the lower arm of the branch, and 4-α-fucosylated on the GlcNAc of the terminal lacto-N-biose 76 . All three HMOs are predicted by the model, which are defined as 0/2 according to the base-core classification. Their Glycologue structure identifiers (see Supplementary Table S2 Figure 2. Reactions of the model. A reaction network generated by three iterations of the enzyme simulator, starting from lactose. Enzymes reactions are represented arrows leading from an acceptor substrate to an acceptor product, coloured according to the type of monosaccharide transferred: GlcNAc (blue), Gal (yellow), Fuc (red), Neu5Ac (purple). For simplicity, donor substrates and nucleotide products are omitted. The structure underlined in red was not found among the observational data set (Supplementary Table S1).  Table S1), and composition H6N4, is shown in Fig. 5B. It requires the actions of the two GalT enzymes, and is the product of a set of possible sequences of activities, such as (4,1,10,1,4,11,6,2,6,2), when acting on lactose as initial substrate. It is based on lacto-N-neohexaose , LnNH, a core-2/2 structure that has been reported in several studies 48,49,53,54 , and which is itself derived from the core-0/2 lacto-N-neotetraose (LNnT) [48][49][50][51]53,54 . Of note is the (4,11,6) subsequence that is indicative of the formation of the 1/1 motif, even though its base core is 2/2 (cf. Fig. 1: Base core patterns). In Fig. 5B

HMO compositions vs. structures.
As a test of the model, we considered the monosaccharide composition of HMOs as a crude grouping of structures. We entered the compositions of the sub-population defined above into GlyConnect Compozitor 78,79 , a tool that roughly simulates the incremental addition of monosaccharides from a nucleotide sugar to an acceptor. The network of compositions is shown in Fig. 6 where we use the condensed notation (hexose = H, hexosamine = N, fucose = F, sialic acid = S and sulfate = s).
In this process, missing intermediates are inferred as virtual nodes depicted in grey. The HMO structures of our library represent 69 compositions stemming from lactose (Hex 2 = H2) ( Supplementary Fig. S2) and three from LacNAc (H1N1) ) ( Supplementary Fig. S3). In these Supplementary figures, the full network is shown with paths highlighted in orange. This colouring is triggered by hovering the mouse on a node to visualise the reachability of and to other nodes from this point (incoming arrows in orange, outgoing arrows in turquoise). The size of the node reflects the number of publications confirming the presence of the corresponding composition. Candidate structures matching a given composition are suggested as part of the library described above that is stored in GlyConnect 80 and accessible in the dedicated HMO section (https:// glyco nnect. expasy. org/ hmo).
We examined the missing compositions as revealed by the virtual grey nodes, especially those with the greatest influence on the connectivity of the graph. In particular, the absence of intermediary structures between H6N4F1 (matching 8 known structures) and H7N5F1 (matching 3 known structures) led Compozitor to consider H7N4F1 or H6N5F1 as potential connectors. Clearly, if not for those virtual nodes the graph would be disrupted. In the same way, missing data connecting H7N5F2 (matching 5 known structures) and H8N6F2 (matching one known structure) are proposed as H7N6F2 or H8N5F2.
The model validated H6N5F1, as 925 generated structures matched this composition, whereas none were associated with H7N4F1. The enzymes of Table 2 can be divided into subsets based on the type of monosaccharide being transferred. Thus, the set of hexosyltransferases are H E = {1,6}, N-acetylglucosaminyltransferases are N E = {4,10,11}, fucosyltransferases are F E = {2,3,5} and sialyltransferases are S E = {7,8,9}. The graph indicates a preference for GlcNAc first and Hexose second, which could be explained by the mutual interplay of subsets H E and N E , each of whose members, in our model, act on the products of the other. Starting from lactose (H2), we expect members of N E to act first, followed by those of H E , giving H(2 + i)N(i) as the expected composition pattern of cores. By the same reasoning, we would expect a route from H7N5F2 to H8N6F2 to pass through www.nature.com/scientificreports/ virtual node H7N6F2 (N + 1, H + 1), in preference to H8N5F2 (H + 1, N + 1). This was verified by simulations, which predicted 12,955 of the former composition, but none of the latter. Two virtual nodes represent the intermediary linear structures between existing H3, H5 and H7. These are based on preliminary structure assignments from a set of HMOs not included in our validation set, and which are likely to be novel linear chain polygalactosyllactoses (see "Discussion", Other enzyme activities).
The last two virtual nodes are redundant, in the sense that their removal would not disrupt connectivity of the network. However, in their presence, H6N4 (matching 2 known structures) is reachable from 14 nodes in the network and from lactose in particular (blue outgoing arrows in Supplementary Fig. S4). When virtual nodes are not considered then H6N4 is not only reachable through H6N3 or H5N4 and therefore not from the original lactose.
The dataset included in 81 contains 102 HMO compositions some of which representing unexpected extra-large HMOs with a mass greater than 4 KDa. This larger dataset was compared to our library revealing a 70% overlap ( Supplementary Fig. S5). Disregarding the extra-large HMOs, 95% of additional compositions were connected to the main network via virtual or existing nodes. Fourteen virtual nodes were required to complete which for most are new in comparison to the six virtual nodes needed in our library.  Table S1). (D) As C, with only the experimentally observed HMOs highlighted. The position of the starting substrate, lactose, is indicated by a black arrow. Nodes are coloured according to the base core configuration given in Fig. 1: magenta (1/1), blue (2/1), green (1/2), orange (2/2), red (0/1), cyan (0/2), grey (other, unclassified). The locations of DSLNT and inverse-LNnD, within networks B and D, are indicated by red and orange arrows, respectively. Networks were drawn in Tulip 72 using a stress-minimization layout algorithm. www.nature.com/scientificreports/ Epitopes. HMOs display a wide variety of epitopes, as shown in Fig. 7. The expression of antigenic determinants on HMOs is a function of genetic blood group type and the secretor status of the mother [82][83][84][85] . In general, the simulation output matched the percentages of structures with each epitope, except for the Lewis (S)Le x and Le y , which are predicted at higher proportions than in the HMO sample library. Since the model lacks an α3GalNAcT enzyme, the only A-antigen bearing HMO, A-hepta 54 , was not predicted. Lewis b occurs less frequently on the lower arm of an I-branched type-2 structure, a motif observed in one HMO by Remoroza et al. 49 , although not by Wu et al. 54 . Several examples occur among the HMOs synthesised by Prudden and co-workers 48 . Enzyme knockouts. The effect of knocking out enzyme activities in silico, on coverage of the HMOs in the sample library was investigated. With the null knockout represented by 0, each activity, 1-11 of Table 2 was  (4), which when knocked out individually resulted in the lowest coverage values consistently. The lowest coverage, of 1.3%, was obtained with the dual knockout of the α4FucT and β3GnT (iGnT) activities. The distal IGnT activity (11), having a broader specificity than the central IGnT (10), has the greater influence on heterogeneity. The enzymes ranked according to their influence on overall overage of the HMO sample population (the diagonal elements of the square matrix in Fig. 8), are where A > B results in a greater reduction in coverage of the library population when enzyme A is knocked out, compared to the knockout of enzyme B. Maximal coverage was attained with the control (null) knockout, equal to that of the cIGnT knockout.
HMO antigens vary widely between different human populations, but are broadly categorised according to the secretor (Se) and Lewis (Le) status of the mother. Minimal biosynthetic networks corresponding to nonsecretor and/or Lewis-negative mothers were constructed by simulating FUT2 and FUT3 genetic knockouts, corresponding to the elimination of 2-α-and (3/4)-α-l-fucosyltransferase activities of Table 2. A single knockout of α2FucT (3) removes all H antigen, and Le b and Le y epitopes; the resulting network, with all other enzymes available, is shown in Fig. 8B. A FUT3 knockout was modelled through elimination of both the α3FucT and α3FucT activities (enzymes 2 and 5), with the result shown in Fig. 8C. A still smaller population of HMOs is predicted when all three fucosyltransferase activities were removed (Fig. 8D).

Discussion
We have shown that 11 enzyme activities of the model can account for more than 90% of all HMOs analysed in this study, for which we have computed possible reaction networks. We proposed a biosynthetic pathway leading to inverse-LnND, which has revealed a novel HMO core, one of five such cores disclosed by this study. Linking composition data to enzyme activities has enabled us to discriminate between possible routes, where www.nature.com/scientificreports/ unknown intermediates are inferred to exist. While the HMO-Glycologue simulator can be tailored for 2 11 possible phenotypes, the single-and dual-knockouts of enzyme activities enabled us to rank the enzymes according to their degree of influence on the observed HMO population. The influence of β3GnT and β4GalT activities on heterogeneity is not unexpected, as both are involved in the extension of oligosaccharides via LacNAc repeats.
Kinetic and genetic regulation of HMO core types. That the majority of branched HMOs fall into the two 2/1 and 2/2 base-core categories could be explained by the higher activity of EC 2.4.1.38 (1) towards GlcNAc-β1,6-Gal than to GlcNAc-β1,3-Gal 86 . This would suggest that a kinetic competition exists between the two galactosyltransferases (1,6) that favours type-2 termination on the 6-linked arm. Kinetic competition might also explain the asymmetry in the existence of LacNAc repeats among the sample population, which appear only on the 6-linked GlcNAc. Of the 226 HMOs in Supplementary Table S1, there are no occurrences of LacNAcextended (type-2) terminations of "lower" branch and three occurrences of lacto-N-biose extended type 1; of www.nature.com/scientificreports/ the "upper" branch, there are 28 extended with lacto-N-biose (type 1) and 53 that are LacNAc-extended (type 2). Since changes to HMO concentration are observed to vary over the course of lactation 50,87,88 , owing to regulation of the expression of the genes coding for these enzymes, as well as their location in the Golgi, a spatiotemporal separation of the different galactosyltransferase activities 1 and 6 may account for the synthesis of inverse-LNnD (Fig. 5).
Disialyl-lacto-N-tetraose. The disialylated stucture, disialyllacto-N-tetraose (DSLNT) has previously been shown to protect neonatal rats 89 from necrotising enterocolitis (NEC), a serious inflammatory disease of the intestinal wall, to which premature infants especially are prone 90 . Milk from mothers containing DSLNT in elevated quantities has been shown significantly to reduce the likelihood of NEC 23 , and its absence from the milk of the mother may be useful as a predictive marker for the disease 22,24 . Our model was able to predict the sequence of enzyme activities leading to this structure, in a four-step process from lactose: (4,6,8,9) (cf. Table 2), providing an in silico verification of the in vitro synthesis of DSLNT achieved by Prudden et al. 48 . This HMO, which has the composition H3N1S2 (cf. Fig. 5), being both terminated and decorated by sialic acid, appears to be unique to human milk: it is not detected, or only in trace amounts, in other species 91 . The mechanism by which milk containing DSLNT acts to protect the infant from NEC is still, to our knowledge, an open question.  Table 2 on the percentage coverage of the HMO structure library (Supplementary Table S1). In the null/null knockout all enzymes remain active. Colours range from purple (minimal coverage: 1.3%) to yellow (maximal coverage: 91.1%), corresponding to the null/null knockout (0/0) and the α4FucT/β3GnT knockout (2/4), respectively.   1 (7) is generic, acting on β-galactosyl termini, we modelled its major activity, which is towards the type-2 LacNAc acceptor 64 . The enzyme from goat and bovine colostrum has a secondary activity towards type 1 , which might explain the three HMOs in which this motif appears, [ Supplementary Table S1 for IUPAC and GlycoCT notations). We conclude that there exist two, as-yet uncharacterised, 6-O-sulfotransferase enzyme activities of human milk, which are specific for lactose and LNTri II, since no other sulfated compounds were found. All four non-predicted sulfated HMOs have non-sulfated counterparts that were predicted by our model.   51 , could not be predicted owing to the lack of a 3-α-fucosyltransferase acting on galactose. The corresponding 2-α-fucosylated structure, known as FDS-LNH I 47 and DSF-LNH II 40,49 is predicted within the system described here. The existence of A-hepta, discovered by Wu et al. 54 , is likely to be the product of EC 2.4.1.40, glycoproteinfucosylgalactoside α-N-acetylgalactosaminyltransferase, acting on the H antigen. Since only one occurrence of the antigen was found in the sample dataset, this enzyme activity was excluded from the model.
Following the publication of a mass spectral reference library for HMOs based on the NIST Standard Reference Material (SRM) 1953 dataset 49 , the structures of several unknown HMOs were inferred from the original library by means of a bootstrapping approach 92 . Out of 78 of novel HMOs (Table S4 of 92 ), 48 (61.5%) were predicted by HMO-Glycologue with the 11 enzymes of the current model active. Owing to preliminary structural assignments and significantly lower coverage, the data was excluded from the library (Supplementary Table S1). Nevertheless, a review of those structures which were not predicted by the model is instructive, since the existence of some of them might be explained by known enzyme activities of human glycosylation. In what follows, Rn refers to HMO with index number n within the cited dataset 92  VT, is predicted within O-Glycologue. If the two enzymes, C1GalT and C2GnT, were co-expressed in human mammary gland, then such 3-galactosylated "Core-2 HMOs" would be expected to be abundant, but they are not. The [L3]L4 motif is observed in only one other structure, N4010b, from this and a previous study 49 , although structure R2, Galactosyl-FpLNnO, might be another candidate. Instead, a possible explanation for N4010b, [[L4Y6][L3]L4G], is that it is the product of C2GnT acting on 3′-GL, formed by the galactoside 3′-β-galactosyltransferase referred to above, followed by β4GalT. The hypothetical pathway is illustrated in Fig. 10. The structure N4010b, which is also designated HMO core IV, or lacto-N-novopentaose I 40 , and known to occur in milk from marsupial species 93,94 , along with further products derived from it 95 . By similar reasoning, Novo-LNP I (R7) could be a "Core-4" HMO. The enzymic origin of these structures therefore remains to be elucidated.
From our analysis of HMO compositions, our results concluded that certain virtual nodes were predicted within the system, but not others (see "HMO compositions vs. structures"). For instance, composition H8N5F2 that is virtual in our library network (Fig. 6) is real in the Porfirio et al. 81 dataset, thereby supporting the selection of H8N5F2 over H7N6F2 as a valid path connecting H7N5F2 and H8N6F2. If α-galactosyltransferase enzymes are active during lactation, it might also explain the higher ratio of hexose to GlcNAc of such compositions.
The dodecaose series 92 are differentially fucosylated, and all of them would be predicted by the model were it not for the presence of the repeating L3Y3 units (L3Y3L3Y3) on the lower branches of around half of them. This unusual sequence, reported in O-glycans of human colonic 97,98 and gastric 99 mucin, are not formed by EC 2.4.1.149, which is instead responsible for the formation of polyLacNAc. Their presence would suggest the activity of an unknown β3GnT enzyme activity that recognises only the terminal [L3Y3*, as acceptor, since the type-1 extended upper arm, L3Y3L3Y6, is not observed. Another sequence that is novel is L4Y3L3Y3, i.e. type-1 structures extended by LacNAc to form type 2, which would infer that β4GalT does not recognise the [Y3L3. We note that the same motif exists in DF-para-LNO I (R58), to which the structure identifier [[f2]L3[f4] Y3L3Y3L4Y3L4G] was assigned.
The novel Iso-LNnO oligosaccharide, which features a terminal Gal 6-linked to a preceding GlcNAc, would require a β6GalT enzyme to form the substructure L6Y. The β(1 → 6)-galactosyltransferase referred to above, www.nature.com/scientificreports/ which forms 6′-GL, might also be responsible for the addition of Gal to GlcNAc, although, to our knowledge, this functionality has not yet been demonstrated. The polyGal structures pentagalactosyllactose, heptagalactosyllactose and octagalactosyllactose R71-R73; might be initiated by the enzyme responsible for early heparan/chondroitin biosynthesis, galactosylxylosylprotein 3-β-galactosyltransferase (EC 2.4.1.134), if that enzyme were active towards glucose in addition to xylose. If the di-, tri-and tetra-galactosyllactose precursors of these larger structures exist, they were not reported, nor included in the NIST mass spectral HMO reference library.

Conclusions
As samples produced in various geographical, nutritional and health conditions accumulate, it is very likely that more HMOs will be identified. In light of the results presented here, we envisage that the Glycologue HMOenzyme simulator will be extended, or adapted, as our knowledge of the enzymes and their substrate specificities improves. The model presented here is a pathway model, which does not account for abundances of individual oligosaccharides, which will require additional knowledge of the cellular environment, and mechanistic details of the enzymes involved, and their localisation within the cell. Such models have already been proposed for the biosynthesis of N-linked 100 and O-linked 101 glycans. The development of a mathematical model based on the core network of Fig. 2, for example, might help to elucidate the relative contributions of the enzyme activities proposed by the current network model, and the overall expected distributions of HMOs given in Fig. 4B and D.
Our analysis of the HMO glycome has raised several questions on individual and combined enzyme activities. Since the model is based on the assumption that O-linked glycosylation enzymes are involved also in milk biosynthesis, our results which will require experimental verification using free-oligosaccharide acceptors. Some questions remain to be answered, such as whether it is possible that the unknown I-branching enzyme, dIGnT, is less specific than cIGnT, and if it can use either type-1 (β3-Gal) or type-2 (β4-Gal) HMOs as acceptors. It is hoped that this analysis will promote further examination of these enzymes, including those yet to be characterised, such as the galactoside β-galactosyltransferases. With its predictive power, the model can also be considered as a guide for experimental synthesis of HMOs, which would potentially enable testing with specialised microarrays.

Methods
Formal language. The method is based on a formal language, which uses a single-letter code for the monosaccharides, as shown in Table 1, and a set of transformation rules that add one monosaccharide at a time, using a regular-expression based pattern matching to model the enzyme activities. A software implementation of the method, Glycologue, acts iteratively on an initial acceptor-substrate, passing it to each enzyme in turn, and accumulating a set of acceptor-products. The pool of novel acceptor-products become the substrates at the next iteration, until either no new products are formed, or a maximum number of iterations set by the user has been attained. Since extension of oligosaccharides occurs principally by means of LacNAc (Gal-β1,4-GlcNAc), simulations could be limited by the number of GlcNAc residues incorporated. The reaction network was deemed to have closed at iteration i when no further products were added at the next iteration, i + 1. Simulations could also be limited by supplying a target composition value such as H4N3F1S1 (4 Hex, 3 HexNAc, 1 × dHex, 1 × Neu5Ac), such that enzymes would act on a substrate only if its composition did not exceed the prescribed value of any class of monosaccharide. The number of possible oligosaccharide structures matching that composition was then calculated. Table 2 lists the biosynthetic enzymes predicted by the model, with an index number, 1-11, the EC number, where available, a short name, a longer accepted name and a reaction pattern. The reactions in Table 2 are based on activities of enzymes already classified within the IUBMB Enzyme List, or from the cited references, wherever an EC number is not available. In reaction patterns, asterisks act as a wildcard character, to denote portions of the molecule of indeterminate length. The glossary in Table 2, footnote b, shows some of the assumptions implicit to the model, such as the anomeric configuration of the donors, from which it can be inferred which enzymes invert or retain the stereochemical configuration of the donor during incorporation into the acceptor. Reactions could, in addition, be limited by Boolean conditional regular expressions. In the case of two I-branching enzymes, 10 and 11, it was assumed that these enzymes would not be active towards 3-fucolactose (3-FL) substrates.

Enzymes simulated.
Glycosyltransferase reaction-pattern classification. The types of reaction catalysed are classified according to a limited number of transformation patterns. We consider the default mode of action to be the extension of a linear oligosaccharide, by where x and y are monosaccharides, Ax is the nucleotide-sugar donor, and yB the acceptor-substrate.
The formation of a single branch along a linear chain is described as decoration, where the pattern is and we have assumed that [x]y is a shorthand for *[x]y*B, the asterisks acting as a wildcard character. Double branches are used to form symmetric core structures, such as the trimannosyl core of N-glycans, or O-linked glycan cores based on GalNAc: (1) Ax + yB = xyB + A www.nature.com/scientificreports/ Capping of branches and linearly extended chains is achieved through termination, of which sialylation is a typical example. Modification of monosaccharides, such as by the actions of sulfotransferases, follows the same pattern as decoration (2). Glycologue structure identifiers order branches by linkage position, writing the branch with the lowest linkage first, reading from right to left. Modifiers are written before sugars units, and multiple modifiers on the same sugar are again ordered by linkage position, from lowest to highest, reading right to left. Boolean conditionals can be applied to enzymes, to prevent action in the presence or absence of a particular recognition motif.
This classification, and the numbering rules, enables the assignment of a unique Glycologue structure identifiers to HMOs with determined structures. Glycosyltransferases 1, 4 and 6 are involved in extension, and the sialyltransferases in termination, by pattern (1); the fucosyltransferases 2, 3 and 5, and ST6GlcNAc (9) are involved in decoration according to reaction pattern (2), and GTs 10 and 11 form double branches according to pattern (3).
HMO structure prediction. The activities of the enzymes could be reversed, and all reaction paths leading from a given HMO to lactose determined. For any set of such initial substrates supplied to the reversed simulator, the complete collection of such paths leading from lactose provided a minimal biosynthetic reaction network. Individual paths are represented by ordered sequences of enzyme activities. For example, the formation of LST b, which has the structure identifier [L3[S6]Y3L4G], has sequence (4,6,9), when starting from lactose. As there can be multiple routes to a given HMO, the networks generated in both the forward and reverse directions possess a lattice-like structure.
Web application. A web application interface to the HMO-enzyme simulator, the set of experimentally determined HMOs used as validation and the source code of the simulator as a Python 3 script, are available at https:// glyco logue. org/m. The Glycologue family of simulators 46,96,102 supports import and export of IUPAC short form and GlycoCT condensed formats, along with the native Glycologue structure identifiers, while export as Linear Code is also provided for individual structures. Networks can be exported as SBML, with GlycoCT-XML embedded as annotations of the nodes.

Data availability
The HMO structures analysed in this study are available as CSV files (Supplementary Tables S1 and S2). The enzyme model simulator is available at https:// glyco logue. org/m/. GlyConnect Compozitor is available at https:// glyco nnect. expasy. org/ compo zitor/.